Load the functions

source("thesis_functions.R")
dhist.plot <- function(d, v, vname){
  ggplot(d, aes(x=v)) + 
  geom_histogram(aes(y = stat(density)), bins = 30, fill = "white", col = "black") +
  stat_function(
    fun = dnorm, 
    args = list(mean = mean(d$v), sd = sd(d$v)), 
    lwd = 1, 
    col = 'red'
  ) +
  # xlab(vname) + ylab("Density") +
  # ggtitle(paste(vname, "Distribution")) +
  theme_bw()
}

set levels of ordinal data, and other variables to their appropriate dataypes. Did not adjust the datatypes for date/time since those will not be used in this portion

data <- fread("kpopdata.csv")
data <- mutate(data, ArtistType = as.factor(ArtistType),
               ArtistGender = as.factor(ArtistGender),
               ArtistGen = factor(ArtistGen),
               release_date = as.POSIXct(str_sub(data$release_date, 1, 10)),
               key = factor(key, levels = 0:11),
               mode = as.factor(mode),
               time_signature = factor(time_signature, levels = c(1,3,4,5)))

#make month/year as continuous data in the form of number of months (~360)
ref.year = 1992 #we are using January 1992 as the reference
data$months = 12*(year(data$release_date) - ref.year) + month(data$release_date)
#View(select(data, Artist, song_name, release_date, months))

Testing how to transform back from a log function ... this works :D

data$months[1:10]
##  [1] 243 243 243 243 243 243 243 240 345 309
log.trans <- log(360-data$months);log.trans[1:10]
##  [1] 4.762174 4.762174 4.762174 4.762174 4.762174 4.762174 4.762174 4.787492
##  [9] 2.708050 3.931826
back.trans <- (exp(log.trans) - 360) / -1;back.trans[1:10]
##  [1] 243 243 243 243 243 243 243 240 345 309

Assessing normality of the Response Variable: Month of release date.

original months value

Summary of the data to gain an understanding of the range.

summary(data$months)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0   241.0   285.0   266.3   317.0   349.0

check out distribution of the months

#dhist.plot(data, months, "Month Released")
ggplot(data, aes(x=months)) +
  geom_histogram(aes(y = stat(density)), bins = 30, fill = "white", col = "black") +
  stat_function(
    fun = dnorm, 
    args = list(mean = mean(data$months), sd = sd(data$months)), 
    lwd = 1, 
    col = 'red'
  ) +
  xlab("Month Released") + ylab("Density") +
  ggtitle(paste("Month Released", "Distribution")) +
  theme_bw()

#produce normal qqplot to assess normality
qqnorm(data$months, pch = 1, frame = FALSE)
qqline(data$months, col = "red", lwd = 2)

#Skewness score
skewness(data$months)
## [1] -1.342445

try to find a transformation to normalize. As we can see the distribution of months is severely skewed to the left.

log transform for moderately negative skewed data

A common distribution towards normality for moderately skewed data is : \(log_{10}(max(y+1) - y)\), since the max number of months is 349, the transformation would be : \(log(350 - y)\)

summary of the data under this transformation

summary(log(350-data$months))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.497   4.174   4.029   4.691   5.849

Already by looking at the 5 number summary it is evidence that there exists skewness isnce the median is much closer to the max than the min.

ggplot(data, aes(x=log(350-data$months))) +
  geom_histogram(aes(y = stat(density)), bins = 30, fill = "white", col = "black") +
  stat_function(
    fun = dnorm, 
    args = list(mean = mean(log(350-data$months)), sd = sd(log(350-data$months))), 
    lwd = 1, 
    col = 'red'
  ) +
  xlab("Month Released : log(350 - y))") + ylab("Density") +
  ggtitle(paste("Month Released : log(350 - y))", "Distribution")) +
  theme_bw()

#produce normal qqplot to assess normality
qqnorm(log(350-data$months), pch = 1, frame = FALSE)
qqline(log(350-data$months), col = "red", lwd = 2)

#Skewness score
skewness(log(350-data$months))
## [1] -0.8029814

The distribution looks more normal, but still has noticeable issues of skewness. Therefore, the transformation has not been successful in achieving normality as the observations deviate to much from the normal QQ line.

Since there still exists issues of skewness, we will tweak the standard transformation of negative skewness to this alternative: \(log(360 - y)\)

summary of the data under this transformation

summary(log(360-data$months))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.398   3.761   4.317   4.269   4.779   5.878

Notice that the maximum and minimum value are still very similar to before, however the minimum has increased with this change and now the median is at a more central location than before.

ggplot(data, aes(x=log(360-data$months))) +
  geom_histogram(aes(y = stat(density)), bins = 30, fill = "white", col = "black") +
  stat_function(
    fun = dnorm, 
    args = list(mean = mean(log(360-data$months)), sd = sd(log(360-data$months))), 
    lwd = 1, 
    col = 'red'
  ) +
  xlab("Month Released : log(360 - y))") + ylab("Density") +
  ggtitle(paste("Month Released : log(360 - y))", "Distribution")) +
  theme_bw()

#produce normal qqplot to assess normality
qqnorm(log(360-data$months), pch = 1, frame = FALSE)
qqline(log(360-data$months), col = "red", lwd = 2)

#calculate skewness
skewness(log(360-data$months))
## [1] -0.2249134

The distribution of months has greatly improved and looks roughly normal. Furthermore, the normal qqplot shows great improvement

square root transform for moderately negative skewed data

summary(sqrt(350-data$months))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.745   8.062   8.382  10.440  18.628

Another common transformation for moderate skewed data is a square root approach: \(\sqrt{max(y+1) - y}\) = \(\sqrt{(350 - y)}\)

ggplot(data, aes(x=sqrt(350-data$months))) +
  geom_histogram(aes(y = stat(density)), bins = 30, fill = "white", col = "black") +
  stat_function(
    fun = dnorm, 
    args = list(mean = mean(sqrt(350-data$months)), sd = sd(sqrt(350-data$months))), 
    lwd = 1, 
    col = 'red'
  ) +
  xlab("Month Released : sqrt(350 - y))") + ylab("Density") +
  ggtitle(paste("Month Released : sqrt(350 - y))", "Distribution")) +
  theme_bw()

#produce normal qqplot to assess normality
qqnorm(sqrt(350-data$months), pch = 1, frame = FALSE)
qqline(sqrt(350-data$months), col = "red", lwd = 2)

#calculate skewness
skewness(sqrt(350-data$months))
## [1] 0.433938

It is still somewhat positively skewed, but pretty good!

Inverse transform for severely negative skewed data

Lastly, a common transformation for severely negatively skewed data takes an inverse approach: \(\frac{1}{(max(y)+1) - y)}\) = \(\frac{1}{(350 - y)}\)

ggplot(data, aes(x=1/(350-data$months))) +
  geom_histogram(aes(y = stat(density)), bins = 30, fill = "white", col = "black") +
  stat_function(
    fun = dnorm, 
    args = list(mean = mean(1/(350-data$months)), sd = sd(1/(350-data$months))), 
    lwd = 1, 
    col = 'red'
  ) +
  xlab("Month Released : 1/(350 - y))") + ylab("Density") +
  ggtitle(paste("Month Released : 1/(350 - y))", "Distribution")) +
  theme_bw()

This is a terrible idea. no.

splitting into training and test data.

Create Training (75%) and Test data (25%) to train classification models on.

kpop <- dplyr::select(data, months, popularity, duration, acousticness, danceability, energy, instrumentalness, key, loudness, mode, speechiness, tempo, valence)
kpop0 <- kpop %>% dplyr::filter(mode == 0) %>% dplyr::select(-mode)
kpop1 <- kpop %>% dplyr::filter(mode == 1) %>% dplyr::select(-mode)

### Kpop mode 0 train and test
smpl.size0 <- floor(0.75*nrow(kpop0))
set.seed(123)
smpl0 <- sample(nrow(kpop0), smpl.size0, replace = FALSE)
train0 <- kpop0[smpl0,]
test0 <- kpop0[-smpl0,]

### Kpop mode 1 train and test
smpl.size1 <- floor(0.75*nrow(kpop1))
set.seed(123)
smpl1 <- sample(nrow(kpop1), smpl.size1, replace = FALSE)
train1 <- kpop1[smpl1,]
test1 <- kpop1[-smpl1,]

Assessing performance of the transformations on Multiple linear regression for mode 0 songs

ml0 <- lm(months ~. , data = train0)
summary(ml0)
## 
## Call:
## lm(formula = months ~ ., data = train0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -220.905  -29.308    6.643   37.204  199.560 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      506.91441   16.88344  30.024  < 2e-16 ***
## popularity         1.60481    0.05435  29.528  < 2e-16 ***
## duration          -0.33770    0.03029 -11.151  < 2e-16 ***
## acousticness      31.41991    6.58842   4.769 1.93e-06 ***
## danceability     -61.87134   10.53878  -5.871 4.74e-09 ***
## energy           -98.67824   10.85541  -9.090  < 2e-16 ***
## instrumentalness  69.35253   14.06277   4.932 8.54e-07 ***
## key1              -3.38890    4.81002  -0.705 0.481137    
## key2              -8.25574    5.95496  -1.386 0.165724    
## key3              -4.68988    6.66026  -0.704 0.481381    
## key4             -16.64198    4.72042  -3.526 0.000428 ***
## key5              -6.37892    4.57708  -1.394 0.163508    
## key6              -7.69000    4.77927  -1.609 0.107700    
## key7              -3.91007    5.23715  -0.747 0.455354    
## key8              -1.77216    5.65747  -0.313 0.754115    
## key9             -16.31280    4.85082  -3.363 0.000780 ***
## key10            -13.05634    4.99295  -2.615 0.008962 ** 
## key11            -13.43026    4.45880  -3.012 0.002613 ** 
## loudness          14.12750    0.67833  20.827  < 2e-16 ***
## speechiness       39.57966   15.28264   2.590 0.009642 ** 
## tempo             -0.12300    0.04024  -3.057 0.002255 ** 
## valence          -25.15503    5.77332  -4.357 1.36e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 55.61 on 3482 degrees of freedom
## Multiple R-squared:  0.4143, Adjusted R-squared:  0.4108 
## F-statistic: 117.3 on 21 and 3482 DF,  p-value: < 2.2e-16

check assumptions with diagnostic plots

#par(mfrow = c(4,1))
plot(ml0)

*Scale-Location: Due to the clear decreasing line rather thana flat horizontal line of equally spread points, there is clear evidence of violation for homogeneity of the variance of the residuals. May need to run the model on a transformation of the outcome variable which is the months of song release.

qqnorm(ml0$residuals)
qqline(ml0$residuals, col = "red")

checking for multicollinearity

car::vif(ml0)
##                      GVIF Df GVIF^(1/(2*Df))
## popularity       1.166795  1        1.080183
## duration         1.095607  1        1.046713
## acousticness     1.572984  1        1.254187
## danceability     1.463926  1        1.209928
## energy           2.625882  1        1.620457
## instrumentalness 1.062607  1        1.030828
## key              1.085836 11        1.003750
## loudness         1.974944  1        1.405327
## speechiness      1.078396  1        1.038458
## tempo            1.112227  1        1.054622
## valence          1.540583  1        1.241202

Overall, the current model does not meet the assumptions of the linear model. We need to make a transformation on the response variable to achieve normality of the residuals

yhat.ml0 = predict(ml0, newdata = test0)
data.frame(
  RMSE = RMSE(yhat.ml0, test0$months),
  R2 = R2(yhat.ml0, test0$months)
)
##       RMSE        R2
## 1 56.76951 0.3780801

MLR: Box Cox Transformation

boxcox(ml0,lambda = seq(1.7, 3, 0.1), plotit = TRUE)

Optimal transformation is for \(\lambda = 2.3\)

summary(bc.transform(data$months, 2.3))
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      5.01 130891.04 192492.74 180682.07 245871.22 306739.73
hist(bc.transform(data$months, 2.3))

ml0.bc <- lm(bc.transform(months, 2.3) ~. , data = train0)
summary(ml0.bc)
## 
## Call:
## lm(formula = bc.transform(months, 2.3) ~ ., data = train0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -182515  -45410    3662   45295  209577 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      387391.90   19370.62  19.999  < 2e-16 ***
## popularity         2137.24      62.36  34.275  < 2e-16 ***
## duration           -375.40      34.75 -10.804  < 2e-16 ***
## acousticness      33787.57    7558.99   4.470 8.08e-06 ***
## danceability     -61484.25   12091.30  -5.085 3.87e-07 ***
## energy           -72490.65   12454.57  -5.820 6.40e-09 ***
## instrumentalness  79343.44   16134.42   4.918 9.16e-07 ***
## key1              -1387.89    5518.61  -0.251  0.80145    
## key2              -5870.47    6832.21  -0.859  0.39027    
## key3              -6175.19    7641.41  -0.808  0.41908    
## key4             -15890.01    5415.81  -2.934  0.00337 ** 
## key5              -4046.15    5251.36  -0.770  0.44106    
## key6              -6788.43    5483.33  -1.238  0.21580    
## key7                670.53    6008.66   0.112  0.91115    
## key8                661.37    6490.90   0.102  0.91885    
## key9             -15530.42    5565.41  -2.791  0.00529 ** 
## key10            -15579.34    5728.48  -2.720  0.00657 ** 
## key11            -12658.78    5115.65  -2.475  0.01339 *  
## loudness          11349.58     778.26  14.583  < 2e-16 ***
## speechiness       37433.06   17534.00   2.135  0.03284 *  
## tempo              -128.82      46.17  -2.790  0.00530 ** 
## valence          -33700.33    6623.82  -5.088 3.81e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63800 on 3482 degrees of freedom
## Multiple R-squared:  0.4159, Adjusted R-squared:  0.4124 
## F-statistic: 118.1 on 21 and 3482 DF,  p-value: < 2.2e-16

check diagnostic plots

plot(ml0.bc)

normality has improved but nothing else ...

yhat.ml0.bc = predict(ml0.bc, newdata = test0 %>% mutate(months = bc.transform(test0$months, 2.3)))
data.frame(
  RMSE = RMSE(yhat.ml0.bc, bc.transform(test0$months, 2.3)),
  R2 = R2(yhat.ml0.bc, bc.transform(test0$months, 2.3))
)
##       RMSE        R2
## 1 64627.64 0.3899321

MLR: log(360 - y) Transformation

hist(log(360-train0$months))

summary(log(360-train0$months))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.398   3.732   4.304   4.261   4.796   5.878
ml0.log360 <- lm(log(360-train0$months) ~. , data = train0)
summary(ml0.log360)
## 
## Call:
## lm(formula = log(360 - train0$months) ~ ., data = train0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.39175 -0.40146  0.04915  0.44693  1.61319 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.7389301  0.1848139  14.820  < 2e-16 ***
## popularity       -0.0215638  0.0005949 -36.246  < 2e-16 ***
## duration          0.0032350  0.0003315   9.758  < 2e-16 ***
## acousticness     -0.2401500  0.0721199  -3.330 0.000878 ***
## danceability      0.4237910  0.1153624   3.674 0.000243 ***
## energy            0.5720095  0.1188284   4.814 1.54e-06 ***
## instrumentalness -0.8481059  0.1539376  -5.509 3.86e-08 ***
## key1             -0.0127793  0.0526528  -0.243 0.808245    
## key2              0.0157264  0.0651857   0.241 0.809372    
## key3              0.0354738  0.0729063   0.487 0.626596    
## key4              0.1181835  0.0516720   2.287 0.022245 *  
## key5              0.0170423  0.0501029   0.340 0.733768    
## key6              0.0505054  0.0523161   0.965 0.334417    
## key7             -0.0435365  0.0573283  -0.759 0.447650    
## key8             -0.0555081  0.0619293  -0.896 0.370147    
## key9              0.1087090  0.0530993   2.047 0.040706 *  
## key10             0.1264746  0.0546551   2.314 0.020723 *  
## key11             0.1012854  0.0488081   2.075 0.038044 *  
## loudness         -0.0871796  0.0074253 -11.741  < 2e-16 ***
## speechiness      -0.3296733  0.1672909  -1.971 0.048842 *  
## tempo             0.0009236  0.0004405   2.097 0.036072 *  
## valence           0.3243115  0.0631975   5.132 3.03e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6087 on 3482 degrees of freedom
## Multiple R-squared:  0.4105, Adjusted R-squared:  0.4069 
## F-statistic: 115.4 on 21 and 3482 DF,  p-value: < 2.2e-16

check diagnostic plots

plot(ml0.log360)

Predictions and performance diagnostics

yhat.ml0.log360 = predict(ml0.log360, newdata = test0 %>% mutate(months = log(360 - test0$months)))
data.frame(
  RMSE = RMSE(yhat.ml0.log360, log(360 - test0$months)),
  R2 = R2(yhat.ml0.log360, log(360 - test0$months))
)
##        RMSE        R2
## 1 0.5976752 0.3932124

MLR: sqrt(350 - y) Transformation

hist(sqrt(350-train0$months))

summary(sqrt(350-train0$months))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.635   8.000   8.377  10.536  18.628
ml0.sqrt350 <- lm(sqrt(350-train0$months) ~. , data = train0)
summary(ml0.sqrt350)
## 
## Call:
## lm(formula = sqrt(350 - train0$months) ~ ., data = train0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.4296  -2.0157  -0.0706   2.0165   9.4299 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.352232   0.877636  -1.541  0.12346    
## popularity       -0.097745   0.002825 -34.598  < 2e-16 ***
## duration          0.016863   0.001574  10.711  < 2e-16 ***
## acousticness     -1.411834   0.342480  -4.122 3.84e-05 ***
## danceability      2.630235   0.547828   4.801 1.64e-06 ***
## energy            3.783006   0.564287   6.704 2.36e-11 ***
## instrumentalness -3.941710   0.731012  -5.392 7.43e-08 ***
## key1              0.040263   0.250035   0.161  0.87208    
## key2              0.235396   0.309551   0.760  0.44704    
## key3              0.212147   0.346214   0.613  0.54007    
## key4              0.712502   0.245377   2.904  0.00371 ** 
## key5              0.187106   0.237926   0.786  0.43169    
## key6              0.315464   0.248436   1.270  0.20424    
## key7             -0.036510   0.272238  -0.134  0.89332    
## key8             -0.112283   0.294087  -0.382  0.70263    
## key9              0.678401   0.252156   2.690  0.00717 ** 
## key10             0.660914   0.259544   2.546  0.01093 *  
## key11             0.589000   0.231778   2.541  0.01109 *  
## loudness         -0.560716   0.035261 -15.902  < 2e-16 ***
## speechiness      -1.819528   0.794423  -2.290  0.02206 *  
## tempo             0.005484   0.002092   2.622  0.00879 ** 
## valence           1.494620   0.300109   4.980 6.66e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.89 on 3482 degrees of freedom
## Multiple R-squared:  0.4249, Adjusted R-squared:  0.4214 
## F-statistic: 122.5 on 21 and 3482 DF,  p-value: < 2.2e-16

check diagnostic plots

plot(ml0.sqrt350)

Predictions and performance diagnostics

yhat.ml0.sqrt350 = predict(ml0.sqrt350, newdata = test0 %>% dplyr::mutate(months = sqrt(350-test0$months)))
data.frame(
  RMSE = RMSE(yhat.ml0.sqrt350, sqrt(350-test0$months)),
  R2 = R2(yhat.ml0.sqrt350, sqrt(350-test0$months))
)
##       RMSE        R2
## 1 2.894936 0.3988485

Applying the log(360 - y) transform to all data

train0$months <- log(360 - train0$months)
test0$months <- log(360 - test0$months)

train1$months <- log(360 - train1$months)
test1$months <- log(360 - test1$months)

Multiple linear regression for mode 0 songs

Full Multiple Linear Regression

set.seed(123)
mlr0 <- lm(months ~. , data = train0)

assess model assumptions

plot(mlr0)

assessing multicolinearity

vif(mlr0)
##                      GVIF Df GVIF^(1/(2*Df))
## popularity       1.166795  1        1.080183
## duration         1.095607  1        1.046713
## acousticness     1.572984  1        1.254187
## danceability     1.463926  1        1.209928
## energy           2.625882  1        1.620457
## instrumentalness 1.062607  1        1.030828
## key              1.085836 11        1.003750
## loudness         1.974944  1        1.405327
## speechiness      1.078396  1        1.038458
## tempo            1.112227  1        1.054622
## valence          1.540583  1        1.241202

Since none of these GVIF values are above 4, we can conclude there are no issues of multicollinearity.

view model

summary(mlr0)
## 
## Call:
## lm(formula = months ~ ., data = train0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.39175 -0.40146  0.04915  0.44693  1.61319 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.7389301  0.1848139  14.820  < 2e-16 ***
## popularity       -0.0215638  0.0005949 -36.246  < 2e-16 ***
## duration          0.0032350  0.0003315   9.758  < 2e-16 ***
## acousticness     -0.2401500  0.0721199  -3.330 0.000878 ***
## danceability      0.4237910  0.1153624   3.674 0.000243 ***
## energy            0.5720095  0.1188284   4.814 1.54e-06 ***
## instrumentalness -0.8481059  0.1539376  -5.509 3.86e-08 ***
## key1             -0.0127793  0.0526528  -0.243 0.808245    
## key2              0.0157264  0.0651857   0.241 0.809372    
## key3              0.0354738  0.0729063   0.487 0.626596    
## key4              0.1181835  0.0516720   2.287 0.022245 *  
## key5              0.0170423  0.0501029   0.340 0.733768    
## key6              0.0505054  0.0523161   0.965 0.334417    
## key7             -0.0435365  0.0573283  -0.759 0.447650    
## key8             -0.0555081  0.0619293  -0.896 0.370147    
## key9              0.1087090  0.0530993   2.047 0.040706 *  
## key10             0.1264746  0.0546551   2.314 0.020723 *  
## key11             0.1012854  0.0488081   2.075 0.038044 *  
## loudness         -0.0871796  0.0074253 -11.741  < 2e-16 ***
## speechiness      -0.3296733  0.1672909  -1.971 0.048842 *  
## tempo             0.0009236  0.0004405   2.097 0.036072 *  
## valence           0.3243115  0.0631975   5.132 3.03e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6087 on 3482 degrees of freedom
## Multiple R-squared:  0.4105, Adjusted R-squared:  0.4069 
## F-statistic: 115.4 on 21 and 3482 DF,  p-value: < 2.2e-16

All variables except Key1, Key2, Key3, Kdy5, Key6, Key7, and Key8 are have a significant linear relationship with the month of a song's release. With an Adjusted R-squared of 0.4069, 40.69% of variation in months can be explained by the significant independent variables.

The Full Multiple Linear Regression model for mode 0 songs is written below with coefficients rounded to the thousandths. \[ log(360 - \hat{months}) = 2.739 - 0.022{popularity} + 0.003{duration} - 0.240{acousticness} + 0.424{danceability} + 0.572{energy} \\ - 0.848{instrumentalness} - 0.013{key1} + 0.016{key2} + 0.035{key3} + 0.118{key4} + 0.017{key5} \\ \\ + 0.051{key6} - 0.087{key7} -0.056{key8} + 0.109{key9} + 0.126{key10} + 0.101{key11} \\ - 0.087{loudness} - 0.330{speechiness} + 0.001{tempo} + 0.324{valence} \]

# prediction on test data
yhat.mlr0 = predict(mlr0, newdata=test0)
# RMSE for test data
error.mlr0 <- yhat.mlr0 - test0$months
rmse.mlr0 <- sqrt(mean(error.mlr0^2))
data.frame(
  RMSE = RMSE(yhat.mlr0, test0$months),
  R2 = R2(yhat.mlr0, test0$months)
)
##        RMSE        R2
## 1 0.5976752 0.3932124

Variable Selection: Stepwise 10 fold Cross Validation

set.seed(123)
train_control <- trainControl(method = "cv", 
                              number = 10) 
bye <- capture.output(mlr.step_kcv0 <- train(months ~. , data = train0,  
                 method = "lmStepAIC", trControl = train_control)) 
print(mlr.step_kcv0)
## Linear Regression with Stepwise Selection 
## 
## 3504 samples
##   11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3153, 3153, 3153, 3154, 3155, 3153, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.6115382  0.4023572  0.4944596
summary(mlr.step_kcv0$finalModel)
## 
## Call:
## lm(formula = .outcome ~ popularity + duration + acousticness + 
##     danceability + energy + instrumentalness + key4 + key6 + 
##     key9 + key10 + key11 + loudness + speechiness + tempo + valence, 
##     data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.38943 -0.40057  0.04912  0.44462  1.61241 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.7327797  0.1802230  15.163  < 2e-16 ***
## popularity       -0.0215505  0.0005940 -36.280  < 2e-16 ***
## duration          0.0032225  0.0003308   9.741  < 2e-16 ***
## acousticness     -0.2409479  0.0719685  -3.348 0.000823 ***
## danceability      0.4254183  0.1152193   3.692 0.000226 ***
## energy            0.5722443  0.1185570   4.827 1.45e-06 ***
## instrumentalness -0.8523074  0.1537756  -5.543 3.20e-08 ***
## key4              0.1245219  0.0351732   3.540 0.000405 ***
## key6              0.0567355  0.0359754   1.577 0.114871    
## key9              0.1150826  0.0371718   3.096 0.001977 ** 
## key10             0.1327266  0.0392897   3.378 0.000738 ***
## key11             0.1074954  0.0307223   3.499 0.000473 ***
## loudness         -0.0870517  0.0074155 -11.739  < 2e-16 ***
## speechiness      -0.3278520  0.1670495  -1.963 0.049772 *  
## tempo             0.0009343  0.0004402   2.122 0.033869 *  
## valence           0.3244776  0.0629704   5.153 2.71e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6085 on 3488 degrees of freedom
## Multiple R-squared:  0.4099, Adjusted R-squared:  0.4074 
## F-statistic: 161.5 on 15 and 3488 DF,  p-value: < 2.2e-16

The Stepwise Multiple Linear Regression model for mode 0 songs is written below with coefficients rounded to the thousandths.

\[ log(360 - \hat{months}) = 2.732 - 0.022{popularity} + 0.003{duration} - 0.241{acousticness} + 0.425{danceability} + 0.572{energy} \\ - 0.852{instrumentalness} + 0.125{key4} + 0.057{key6} + 0.115{key9} + 0.133{key10} + 0.107{key11 } \\ - 0.087{loudness} - 0.328{speechiness} + 0.001{tempo} + 0.324{valence} \]

Compared to the full multiple linear regression model, the stepwise regression model omits the variables: Key1, Key2, Key3, Key5, Key7, and Key8. Furthermore, all the variables except Key6 have a significant linear relationship with the response. The percent of variation explained is just .05% larger than that of the full multiple linear regression model at 40.74% of the variance explained.

prediction on test data

# prediction on test data
yhat.step_kcv0 = predict(mlr.step_kcv0$finalModel, newdata=key.dummy(test0))
# RMSE for test data
error.step_kcv0 <- yhat.step_kcv0 - test0$months
rmse.step_kcv0 <- sqrt(mean(error.step_kcv0^2))
data.frame(
  RMSE = RMSE(yhat.step_kcv0, test0$months),
  R2 = R2(yhat.step_kcv0, test0$months)
)
##        RMSE        R2
## 1 0.5980156 0.3925419

Variable Selection: All Subsets Regression (Not using K-folds CV)

set.seed(123)
allreg.models0 <- regsubsets(months ~., data = train0, nvmax = 21)
summary(allreg.models0)
## Subset selection object
## Call: regsubsets.formula(months ~ ., data = train0, nvmax = 21)
## 21 Variables  (and intercept)
##                  Forced in Forced out
## popularity           FALSE      FALSE
## duration             FALSE      FALSE
## acousticness         FALSE      FALSE
## danceability         FALSE      FALSE
## energy               FALSE      FALSE
## instrumentalness     FALSE      FALSE
## key1                 FALSE      FALSE
## key2                 FALSE      FALSE
## key3                 FALSE      FALSE
## key4                 FALSE      FALSE
## key5                 FALSE      FALSE
## key6                 FALSE      FALSE
## key7                 FALSE      FALSE
## key8                 FALSE      FALSE
## key9                 FALSE      FALSE
## key10                FALSE      FALSE
## key11                FALSE      FALSE
## loudness             FALSE      FALSE
## speechiness          FALSE      FALSE
## tempo                FALSE      FALSE
## valence              FALSE      FALSE
## 1 subsets of each size up to 21
## Selection Algorithm: exhaustive
##           popularity duration acousticness danceability energy instrumentalness
## 1  ( 1 )  "*"        " "      " "          " "          " "    " "             
## 2  ( 1 )  "*"        "*"      " "          " "          " "    " "             
## 3  ( 1 )  "*"        "*"      " "          "*"          " "    " "             
## 4  ( 1 )  "*"        "*"      " "          " "          " "    " "             
## 5  ( 1 )  "*"        "*"      " "          " "          "*"    " "             
## 6  ( 1 )  "*"        "*"      " "          " "          "*"    "*"             
## 7  ( 1 )  "*"        "*"      " "          "*"          "*"    "*"             
## 8  ( 1 )  "*"        "*"      "*"          "*"          "*"    "*"             
## 9  ( 1 )  "*"        "*"      "*"          "*"          "*"    "*"             
## 10  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 11  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 12  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 13  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 14  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 15  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 16  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 17  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 18  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 19  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 20  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 21  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
##           key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 loudness
## 1  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   " "     
## 2  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   " "     
## 3  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   " "     
## 4  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   "*"     
## 5  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   "*"     
## 6  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   "*"     
## 7  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   "*"     
## 8  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   "*"     
## 9  ( 1 )  " "  " "  " "  " "  " "  " "  " "  "*"  " "  " "   " "   "*"     
## 10  ( 1 ) " "  " "  " "  " "  " "  " "  "*"  "*"  " "  " "   " "   "*"     
## 11  ( 1 ) " "  " "  " "  "*"  " "  " "  " "  " "  " "  "*"   "*"   "*"     
## 12  ( 1 ) " "  " "  " "  "*"  " "  " "  " "  " "  "*"  "*"   "*"   "*"     
## 13  ( 1 ) " "  " "  " "  "*"  " "  " "  " "  " "  "*"  "*"   "*"   "*"     
## 14  ( 1 ) " "  " "  " "  "*"  " "  " "  " "  " "  "*"  "*"   "*"   "*"     
## 15  ( 1 ) " "  " "  " "  "*"  " "  "*"  " "  " "  "*"  "*"   "*"   "*"     
## 16  ( 1 ) " "  " "  " "  "*"  " "  "*"  " "  "*"  "*"  "*"   "*"   "*"     
## 17  ( 1 ) " "  " "  " "  "*"  " "  "*"  "*"  "*"  "*"  "*"   "*"   "*"     
## 18  ( 1 ) "*"  " "  " "  "*"  " "  "*"  "*"  "*"  "*"  "*"   "*"   "*"     
## 19  ( 1 ) "*"  " "  "*"  "*"  " "  "*"  "*"  "*"  "*"  "*"   "*"   "*"     
## 20  ( 1 ) "*"  " "  "*"  "*"  "*"  "*"  "*"  "*"  "*"  "*"   "*"   "*"     
## 21  ( 1 ) "*"  "*"  "*"  "*"  "*"  "*"  "*"  "*"  "*"  "*"   "*"   "*"     
##           speechiness tempo valence
## 1  ( 1 )  " "         " "   " "    
## 2  ( 1 )  " "         " "   " "    
## 3  ( 1 )  " "         " "   " "    
## 4  ( 1 )  " "         " "   "*"    
## 5  ( 1 )  " "         " "   "*"    
## 6  ( 1 )  " "         " "   "*"    
## 7  ( 1 )  " "         " "   "*"    
## 8  ( 1 )  " "         " "   "*"    
## 9  ( 1 )  " "         " "   "*"    
## 10  ( 1 ) " "         " "   "*"    
## 11  ( 1 ) " "         " "   "*"    
## 12  ( 1 ) " "         " "   "*"    
## 13  ( 1 ) " "         "*"   "*"    
## 14  ( 1 ) "*"         "*"   "*"    
## 15  ( 1 ) "*"         "*"   "*"    
## 16  ( 1 ) "*"         "*"   "*"    
## 17  ( 1 ) "*"         "*"   "*"    
## 18  ( 1 ) "*"         "*"   "*"    
## 19  ( 1 ) "*"         "*"   "*"    
## 20  ( 1 ) "*"         "*"   "*"    
## 21  ( 1 ) "*"         "*"   "*"

Assess models

allreg.res0 <- summary(allreg.models0)
allreg.compare0 <- data.frame(model = c(1:21),
                              Adj.R2 = allreg.res0$adjr2,
                              CP = allreg.res0$cp)
allreg.compare0
##    model    Adj.R2        CP
## 1      1 0.3462533 360.15498
## 2      2 0.3591335 285.02115
## 3      3 0.3707186 217.57361
## 4      4 0.3838071 141.29567
## 5      5 0.3921850  92.84423
## 6      6 0.3971156  64.74779
## 7      7 0.4005655  45.39542
## 8      8 0.4025885  34.46326
## 9      9 0.4033295  31.09094
## 10    10 0.4041639  27.17060
## 11    11 0.4050357  23.03289
## 12    12 0.4061457  17.49622
## 13    13 0.4066369  15.60448
## 14    14 0.4071063  13.84283
## 15    15 0.4073589  13.35758
## 16    16 0.4074023  14.10287
## 17    17 0.4074617  14.75469
## 18    18 0.4073768  16.25484
## 19    19 0.4072293  18.12198
## 20    20 0.4070700  20.05820
## 21    21 0.4069096  22.00000

The model with the smallest CP value and greatest Adjusted R2 value is model number 15.

set.seed(123)
mlr.allreg0 <- lm(months ~ popularity +duration +acousticness +danceability +energy +instrumentalness+
  key4+key6+key9 +key10 +key11 +loudness +speechiness+ tempo +valence, data = key.dummy(train0))
summary(mlr.allreg0)
## 
## Call:
## lm(formula = months ~ popularity + duration + acousticness + 
##     danceability + energy + instrumentalness + key4 + key6 + 
##     key9 + key10 + key11 + loudness + speechiness + tempo + valence, 
##     data = key.dummy(train0))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.38943 -0.40057  0.04912  0.44462  1.61241 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.7327797  0.1802230  15.163  < 2e-16 ***
## popularity       -0.0215505  0.0005940 -36.280  < 2e-16 ***
## duration          0.0032225  0.0003308   9.741  < 2e-16 ***
## acousticness     -0.2409479  0.0719685  -3.348 0.000823 ***
## danceability      0.4254183  0.1152193   3.692 0.000226 ***
## energy            0.5722443  0.1185570   4.827 1.45e-06 ***
## instrumentalness -0.8523074  0.1537756  -5.543 3.20e-08 ***
## key4              0.1245219  0.0351732   3.540 0.000405 ***
## key6              0.0567355  0.0359754   1.577 0.114871    
## key9              0.1150826  0.0371718   3.096 0.001977 ** 
## key10             0.1327266  0.0392897   3.378 0.000738 ***
## key11             0.1074954  0.0307223   3.499 0.000473 ***
## loudness         -0.0870517  0.0074155 -11.739  < 2e-16 ***
## speechiness      -0.3278520  0.1670495  -1.963 0.049772 *  
## tempo             0.0009343  0.0004402   2.122 0.033869 *  
## valence           0.3244776  0.0629704   5.153 2.71e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6085 on 3488 degrees of freedom
## Multiple R-squared:  0.4099, Adjusted R-squared:  0.4074 
## F-statistic: 161.5 on 15 and 3488 DF,  p-value: < 2.2e-16

The All Subsets Regression Multiple Linear Regression model for mode 0 songs is written below with coefficients rounded to the thousandths.

\[ log(360 - \hat{months}) = 2.733 - 0.022{popularity} + 0.003{duration} - 0.241{acousticness} + 0.425{danceability} + 0.572{energy} \\ - 0.852{instrumentalness} + 0.125{key4} + 0.057{key6}+ 0.115{key9} + 0.133{key10} + 0.107{key11} \\ - 0.087{loudness} - 0.328{speechiness} + 0.001{tempo} + 0.324{valence} \]

The multiple linear regression model result from the all subsets regression model is the same as that from the stepwise method. It also has the same significant variables and the adjusted R squared value.

prediction on test data

# prediction on test data
yhat.allreg0 = predict(mlr.allreg0, newdata=key.dummy(test0))
# RMSE for test data
# error.allreg0 <- yhat.allreg0 - test0$months
# rmse.allreg0 <- sqrt(mean(error.allreg0^2))
data.frame(
  RMSE = RMSE(yhat.allreg0, test0$months),
  R2 = R2(yhat.allreg0, test0$months)
)
##        RMSE        R2
## 1 0.5980156 0.3925419

Regularized Regression: Ridge 10 Fold Cross Validation

set.seed(123)
lm.ridge0 <- train(months ~. , data = train0, method = "glmnet",
                      trControl = train_control, 
                      tuneGrid = expand.grid(alpha = 0,lambda = seq(0.001,0.1,by = 0.001)))
#Best Tuning Parameter
lm.ridge0$bestTune
##    alpha lambda
## 45     0  0.045
# best coefficient
lm.ridge0.model <- coef(lm.ridge0$finalModel, lm.ridge0$bestTune$lambda)
lm.ridge0.model
## 22 x 1 sparse Matrix of class "dgCMatrix"
##                              1
## (Intercept)       2.8657735959
## popularity       -0.0205267725
## duration          0.0031661947
## acousticness     -0.2445830342
## danceability      0.4128163847
## energy            0.4823365034
## instrumentalness -0.7491778746
## key1             -0.0342147260
## key2             -0.0056979626
## key3              0.0075189682
## key4              0.0953873860
## key5             -0.0072177244
## key6              0.0225877475
## key7             -0.0627117017
## key8             -0.0749979230
## key9              0.0823586421
## key10             0.0986420788
## key11             0.0783989653
## loudness         -0.0798794742
## speechiness      -0.3084093591
## tempo             0.0008529497
## valence           0.3234950755

The Ridge Multiple Linear Regression model for mode 0 songs is written below with coefficients rounded to the thousandths.

\[ log(360 - \hat{months}) = 2.866 - 0.021{popularity} + 0.003{duration} - 0.245{acousticness} + 0.413{danceability} + 0.482{energy} \\ - 0.749{instrumentalness} - 0.034{key1} - 0.006{key2} + 0.008{key3} + 0.095{key4} - 0.007{key5} \\ \\ + 0.023{key6} - 0.063{key7} - 0.075{key8} + 0.082{key9} + 0.099{key10} + 0.078{key11} \\ - 0.080{loudness} - 0.308{speechiness} + 0.001{tempo} + 0.323{valence} \]

# prediction on test data
yhat.ridge0 = predict(lm.ridge0, s=lm.ridge0$bestTune,newdata=test0)
# RMSE for test data
data.frame(
  RMSE = RMSE(yhat.ridge0, test0$months),
  R2 = R2(yhat.ridge0, test0$months)
)
##        RMSE        R2
## 1 0.5981751 0.3921513

Regularized Regression: Lasso 10 Fold Cross Validation

set.seed(123)
lm.lasso0 <- train(months ~. , data = train0, method = "glmnet",
                      trControl = train_control, 
                      tuneGrid = expand.grid(alpha = 1,lambda = seq(0.001,0.1,by = 0.001)))
#Best Tuning Parameter
lm.lasso0$bestTune
##   alpha lambda
## 2     1  0.002
# best coefficient
lm.lasso0.model <- coef(lm.lasso0$finalModel, lm.lasso0$bestTune$lambda)
lm.lasso0.model
## 22 x 1 sparse Matrix of class "dgCMatrix"
##                              1
## (Intercept)       2.8365852962
## popularity       -0.0215614447
## duration          0.0031812002
## acousticness     -0.2395615122
## danceability      0.4088128971
## energy            0.5333749094
## instrumentalness -0.8078442972
## key1             -0.0270821553
## key2              .           
## key3              0.0040833394
## key4              0.0923273187
## key5              .           
## key6              0.0238454868
## key7             -0.0555632703
## key8             -0.0676519346
## key9              0.0819159291
## key10             0.0992058126
## key11             0.0770663223
## loudness         -0.0839496407
## speechiness      -0.2803564060
## tempo             0.0008109799
## valence           0.3212503863

The Lasso Multiple Linear Regression model for mode 0 songs is written below with coefficients rounded to the thousandths.

\[ log(360 - \hat{months}) = 2.837 - 0.022{popularity} + 0.003{duration} - 0.240{acousticness} + 0.409{danceability} + 0.533{energy} \\ - 0.808{instrumentalness} - 0.027{key1} + 0.004{key3} + 0.092{key4} + 0.024{key6} \\ - 0.056{key7} - 0.068{key8} + 0.082{key9} + 0.099{key10} + 0.077{key11} \\ - 0.084{loudness} - 0.280{speechiness} + 0.001{tempo} + 0.321{valence} \]

Compared to the full model, the variables key2 and key 5 are not kept in the model.

# prediction on test data
yhat.lasso0 = predict(lm.lasso0, s=lm.lasso0$bestTune,newdata=test0)
# RMSE for test data
error.lasso0 <- yhat.lasso0 - test0$months
rmse.lasso0 <- sqrt(mean(error.lasso0^2))
data.frame(
  RMSE = RMSE(yhat.lasso0, test0$months),
  R2 = R2(yhat.lasso0, test0$months)
)
##        RMSE        R2
## 1 0.5975859 0.3932411

Regularized Regression : Elastic Net 10 Fold Cross Validation

lm.enet0 <- train(months ~. , data = train0, method = "glmnet",
                      trControl = train_control, 
                      tuneGrid = expand.grid(alpha = seq(0,1,by = 0.05),lambda = seq(0.001,0.1,by = 0.001)))
# best parameter
lm.enet0$bestTune
##     alpha lambda
## 606   0.3  0.006
# best coefficient
lm.enet0.model <- coef(lm.enet0$finalModel, lm.enet0$bestTune$lambda)
lm.enet0.model
## 22 x 1 sparse Matrix of class "dgCMatrix"
##                              1
## (Intercept)       2.8384749012
## popularity       -0.0214635712
## duration          0.0031807922
## acousticness     -0.2401868860
## danceability      0.4092942334
## energy            0.5283274964
## instrumentalness -0.8023846853
## key1             -0.0272798667
## key2              .           
## key3              0.0050214938
## key4              0.0932464805
## key5              .           
## key6              0.0242996006
## key7             -0.0557886082
## key8             -0.0679070363
## key9              0.0825644898
## key10             0.0997482545
## key11             0.0777745321
## loudness         -0.0835712833
## speechiness      -0.2834343102
## tempo             0.0008161823
## valence           0.3216293040

\[ log(360 - \hat{months}) = 2.838 - 0.021{popularity} + 0.003{duration} - 0.240{acousticness} + 0.409{danceability} + 0.528{energy} \\ - 0.802{instrumentalness} - 0.027{key1} + 0.005{key3} + 0.093{key4} + 0.024{key6} \\ - 0.056{key7} - 0.068{key8} + 0.083{key9} + 0.010{key10} + 0.078{key11} \\ - 0.084{loudness} - 0.280{speechiness} + 0.001{tempo} + 0.322{valence} \]

# prediction on test data
yhat.enet0 = predict(lm.enet0, s=lm.enet0$bestTune, newdata=test0)
# RMSE for test data
error.enet0 <- yhat.enet0 - test0$months
rmse.enet0 <- sqrt(mean(error.enet0^2))
data.frame(
  RMSE = RMSE(yhat.enet0, test0$months),
  R2 = R2(yhat.enet0, test0$months)
)
##        RMSE        R2
## 1 0.5976055 0.3931638

Deciding on the best model

results0 <- data.frame(Model = c("FullMLR", "Stepwise", "AllSubsets","Ridge", "Lasso", "Elastic Net"),
                       RMSE = round(c(0.5976752 , 0.5980156, 0.5980156,0.5981751,0.5975859, 0.5976055), 5),
                       R2 = round(c(0.3932124, 0.3925419, 0.3925419,0.3921513,0.3932411, 0.3931638), 5))
results0
##         Model    RMSE      R2
## 1     FullMLR 0.59768 0.39321
## 2    Stepwise 0.59802 0.39254
## 3  AllSubsets 0.59802 0.39254
## 4       Ridge 0.59818 0.39215
## 5       Lasso 0.59759 0.39324
## 6 Elastic Net 0.59761 0.39316

With the lowest RMSE and the Highest R2 value resulting from the prediction assessments, Lasso Provides the best model for predicting the month a song was released based on the Spotify audio features.

\[ log(360 - \hat{months}) = 2.837 - 0.022{popularity} + 0.003{duration} - 0.240{acousticness} \\ + 0.409{danceability} + 0.533{energy} - 0.808{instrumentalness} \\ - 0.027{key1} + 0.004{key4} + 0.092{key4} + 0.024{key6} \\ - 0.056{key7} - 0.068{key8} + 0.082{key9} + 0.099{key10} + 0.077{key11} \\ - 0.084{loudness} - 0.280{speechiness} + 0.001{tempo} + 0.321{valence} \]

The Lasso model back to original scale would be: \[ 360 - \hat{months} = \frac{exp(2.837 + 0.003{duration} + 0.409{danceability} + 0.533{energy}+ 0.004{key4} + 0.092{key4} + 0.024{key6}+ 0.082{key9} + 0.099{key10} + 0.077{key11} + 0.001{tempo} + 0.321{valence})}{exp(0.022{popularity} + 0.240{acousticness} + 0.808{instrumentalness} + 0.027{key1} + 0.056{key7} + 0.068{key8}+ 0.084{loudness} + 0.280{speechiness})} \]

Multiple linear regression for mode 1 songs

Full Multiple Linear Regression

set.seed(123)
mlr1 <- lm(months ~. , data = train1)

assess model assumptions

plot(mlr1)

view model

summary(mlr1)
## 
## Call:
## lm(formula = months ~ ., data = train1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.48440 -0.39951  0.05378  0.45331  1.76459 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.9901418  0.1434374  20.846  < 2e-16 ***
## popularity       -0.0198841  0.0005122 -38.818  < 2e-16 ***
## duration          0.0020853  0.0002589   8.054 9.73e-16 ***
## acousticness     -0.1274822  0.0508609  -2.506  0.01222 *  
## danceability      0.6314340  0.0873034   7.233 5.39e-13 ***
## energy            0.3826541  0.0973081   3.932 8.51e-05 ***
## instrumentalness -0.7706272  0.1212973  -6.353 2.28e-10 ***
## key1              0.0842570  0.0314613   2.678  0.00743 ** 
## key2              0.0287030  0.0336488   0.853  0.39369    
## key3              0.0202750  0.0516321   0.393  0.69457    
## key4              0.0680017  0.0436575   1.558  0.11938    
## key5              0.0749496  0.0383254   1.956  0.05056 .  
## key6              0.0957015  0.0375012   2.552  0.01074 *  
## key7              0.1019746  0.0312840   3.260  0.00112 ** 
## key8              0.1189275  0.0362960   3.277  0.00106 ** 
## key9              0.0124663  0.0389359   0.320  0.74885    
## key10             0.0810947  0.0453378   1.789  0.07372 .  
## key11             0.0463297  0.0403927   1.147  0.25144    
## loudness         -0.0757560  0.0061732 -12.272  < 2e-16 ***
## speechiness      -0.1799930  0.1442184  -1.248  0.21206    
## tempo             0.0007324  0.0003286   2.229  0.02589 *  
## valence           0.2936440  0.0531944   5.520 3.54e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6245 on 5482 degrees of freedom
## Multiple R-squared:  0.3206, Adjusted R-squared:  0.318 
## F-statistic: 123.2 on 21 and 5482 DF,  p-value: < 2.2e-16

The Full Multiple Linear Regression model for mode 1 songs is written below with coefficients rounded to the thousandths. \[ log(360 - \hat{months}) = 2.990 - 0.020{popularity} + 0.002{duration} - 0.127{acousticness} + 0.631{danceability} + 0.383{energy} \\ - 0.771{instrumentalness} + 0.084{key1} + 0.029{key2} + 0.020{key3} + 0.068{key4} + 0.075{key5} \\ + 0.096{key6} + 0.102{key7} + 0.119{key8} + 0.012{key9} + 0.081{key10} + 0.046{key11} \\ \\ - 0.076{loudness} - 0.180{speechiness} + 0.001{tempo} + 0.294{valence} \]

The summary of the model show us that the variables with a significant linear relationship to the response of month the song was released are: Popularity, Duration, Acousticness, Danceability, Energy, Instrumentalness, Key1, Key6, Key7, Key8, Loudness, Tempo and Valence.

# prediction on test data
yhat.mlr1 = predict(mlr1, newdata=test1)
# RMSE for test data
error.mlr1 <- yhat.mlr1 - test1$months
rmse.mlr1 <- sqrt(mean(error.mlr1^2))
data.frame(
  RMSE = RMSE(yhat.mlr1, test1$months),
  R2 = R2(yhat.mlr1, test1$months)
)
##        RMSE        R2
## 1 0.6078257 0.3312429

Variable Selection: Stepwise 10 fold Cross Validation

set.seed(123)
train_control <- trainControl(method = "cv", 
                              number = 10) 
bye <- capture.output(mlr.step_kcv1 <- train(months ~. , data = train1,  
                 method = "lmStepAIC", trControl = train_control)) 
print(mlr.step_kcv1)
## Linear Regression with Stepwise Selection 
## 
## 5504 samples
##   11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 4953, 4953, 4955, 4954, 4953, 4954, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.6266852  0.3145441  0.5035447
summary(mlr.step_kcv1$finalModel)
## 
## Call:
## lm(formula = .outcome ~ popularity + duration + acousticness + 
##     danceability + energy + instrumentalness + key1 + key5 + 
##     key6 + key7 + key8 + key10 + loudness + tempo + valence, 
##     data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4842 -0.4012  0.0563  0.4534  1.7748 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.0299902  0.1417423  21.377  < 2e-16 ***
## popularity       -0.0199787  0.0005048 -39.580  < 2e-16 ***
## duration          0.0021218  0.0002565   8.272  < 2e-16 ***
## acousticness     -0.1239002  0.0506910  -2.444 0.014548 *  
## danceability      0.6302868  0.0871225   7.234 5.31e-13 ***
## energy            0.3566622  0.0950243   3.753 0.000176 ***
## instrumentalness -0.7653059  0.1209561  -6.327 2.70e-10 ***
## key1              0.0616367  0.0266563   2.312 0.020799 *  
## key5              0.0529274  0.0343880   1.539 0.123832    
## key6              0.0732353  0.0335388   2.184 0.029034 *  
## key7              0.0786565  0.0264008   2.979 0.002902 ** 
## key8              0.0959522  0.0321543   2.984 0.002857 ** 
## key10             0.0597350  0.0419649   1.423 0.154663    
## loudness         -0.0739348  0.0060209 -12.280  < 2e-16 ***
## tempo             0.0006867  0.0003269   2.101 0.035717 *  
## valence           0.2887687  0.0530324   5.445 5.40e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6244 on 5488 degrees of freedom
## Multiple R-squared:   0.32,  Adjusted R-squared:  0.3181 
## F-statistic: 172.1 on 15 and 5488 DF,  p-value: < 2.2e-16

The Multiple Linear Regression model resulting from stepwise 10 fold cross validation for mode 1 songs is written below with coefficients rounded to the thousandths.

\[ log(360 - \hat{months}) = 3.020 - 0.020{popularity} + 0.002{duration} - 0.124{acousticness} + 0.630{danceability} + 0.357{energy} \\ - 0.765{instrumentalness} + 0.062{key1} + 0.053{key5} + 0.073{key6} + 0.079{key7} \\ + 0.096{key8}+ 0.060{key10} - 0.074{loudness} + 0.001{tempo} + 0.289{valence} \\ \]

Compared to the full multiple linear regression model, the stewise multiple regression leaves out the dummy variables of Key2, key3, key4, key11 and speechiness. All of the variables in the model besides Key5 and Key10 have a significant linear relationship with month of song release.

prediction on test data

# prediction on test data
yhat.step_kcv1 = predict(mlr.step_kcv1$finalModel, newdata=key.dummy(test1))
# RMSE for test data
error.step_kcv1 <- yhat.step_kcv1 - test1$months
rmse.step_kcv1 <- sqrt(mean(error.step_kcv1^2))
data.frame(
  RMSE = RMSE(yhat.step_kcv1, test1$months),
  R2 = R2(yhat.step_kcv1, test1$months)
)
##        RMSE        R2
## 1 0.6070693 0.3329448

Variable Selection: All Subsets Regression (Not using K-folds CV)

set.seed(123)
allreg.models1 <- regsubsets(months ~., data = train1, nvmax = 21)
summary(allreg.models1)
## Subset selection object
## Call: regsubsets.formula(months ~ ., data = train1, nvmax = 21)
## 21 Variables  (and intercept)
##                  Forced in Forced out
## popularity           FALSE      FALSE
## duration             FALSE      FALSE
## acousticness         FALSE      FALSE
## danceability         FALSE      FALSE
## energy               FALSE      FALSE
## instrumentalness     FALSE      FALSE
## key1                 FALSE      FALSE
## key2                 FALSE      FALSE
## key3                 FALSE      FALSE
## key4                 FALSE      FALSE
## key5                 FALSE      FALSE
## key6                 FALSE      FALSE
## key7                 FALSE      FALSE
## key8                 FALSE      FALSE
## key9                 FALSE      FALSE
## key10                FALSE      FALSE
## key11                FALSE      FALSE
## loudness             FALSE      FALSE
## speechiness          FALSE      FALSE
## tempo                FALSE      FALSE
## valence              FALSE      FALSE
## 1 subsets of each size up to 21
## Selection Algorithm: exhaustive
##           popularity duration acousticness danceability energy instrumentalness
## 1  ( 1 )  "*"        " "      " "          " "          " "    " "             
## 2  ( 1 )  "*"        " "      " "          "*"          " "    " "             
## 3  ( 1 )  "*"        " "      " "          " "          " "    " "             
## 4  ( 1 )  "*"        "*"      " "          " "          " "    " "             
## 5  ( 1 )  "*"        "*"      " "          "*"          "*"    " "             
## 6  ( 1 )  "*"        "*"      " "          "*"          "*"    "*"             
## 7  ( 1 )  "*"        "*"      " "          "*"          "*"    "*"             
## 8  ( 1 )  "*"        "*"      "*"          "*"          "*"    "*"             
## 9  ( 1 )  "*"        "*"      "*"          "*"          "*"    "*"             
## 10  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 11  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 12  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 13  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 14  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 15  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 16  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 17  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 18  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 19  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 20  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 21  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
##           key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 loudness
## 1  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   " "     
## 2  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   " "     
## 3  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   "*"     
## 4  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   "*"     
## 5  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   "*"     
## 6  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   "*"     
## 7  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   "*"     
## 8  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   "*"     
## 9  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   "*"     
## 10  ( 1 ) " "  " "  " "  " "  " "  " "  "*"  "*"  " "  " "   " "   "*"     
## 11  ( 1 ) " "  " "  " "  " "  " "  " "  "*"  "*"  " "  " "   " "   "*"     
## 12  ( 1 ) "*"  " "  " "  " "  " "  " "  "*"  "*"  " "  " "   " "   "*"     
## 13  ( 1 ) "*"  " "  " "  " "  " "  "*"  "*"  "*"  " "  " "   " "   "*"     
## 14  ( 1 ) "*"  " "  " "  " "  "*"  "*"  "*"  "*"  " "  " "   " "   "*"     
## 15  ( 1 ) "*"  " "  " "  " "  "*"  "*"  "*"  "*"  " "  "*"   " "   "*"     
## 16  ( 1 ) "*"  " "  " "  "*"  "*"  "*"  "*"  "*"  " "  "*"   " "   "*"     
## 17  ( 1 ) "*"  " "  " "  "*"  "*"  "*"  "*"  "*"  " "  "*"   " "   "*"     
## 18  ( 1 ) "*"  " "  " "  "*"  "*"  "*"  "*"  "*"  " "  "*"   "*"   "*"     
## 19  ( 1 ) "*"  "*"  " "  "*"  "*"  "*"  "*"  "*"  " "  "*"   "*"   "*"     
## 20  ( 1 ) "*"  "*"  "*"  "*"  "*"  "*"  "*"  "*"  " "  "*"   "*"   "*"     
## 21  ( 1 ) "*"  "*"  "*"  "*"  "*"  "*"  "*"  "*"  "*"  "*"   "*"   "*"     
##           speechiness tempo valence
## 1  ( 1 )  " "         " "   " "    
## 2  ( 1 )  " "         " "   " "    
## 3  ( 1 )  " "         " "   "*"    
## 4  ( 1 )  " "         " "   "*"    
## 5  ( 1 )  " "         " "   " "    
## 6  ( 1 )  " "         " "   " "    
## 7  ( 1 )  " "         " "   "*"    
## 8  ( 1 )  " "         " "   "*"    
## 9  ( 1 )  " "         "*"   "*"    
## 10  ( 1 ) " "         " "   "*"    
## 11  ( 1 ) " "         "*"   "*"    
## 12  ( 1 ) " "         "*"   "*"    
## 13  ( 1 ) " "         "*"   "*"    
## 14  ( 1 ) " "         "*"   "*"    
## 15  ( 1 ) " "         "*"   "*"    
## 16  ( 1 ) " "         "*"   "*"    
## 17  ( 1 ) "*"         "*"   "*"    
## 18  ( 1 ) "*"         "*"   "*"    
## 19  ( 1 ) "*"         "*"   "*"    
## 20  ( 1 ) "*"         "*"   "*"    
## 21  ( 1 ) "*"         "*"   "*"
allreg.res1 <- summary(allreg.models1)
allreg.compare1 <- data.frame(model = c(1:21),
                              Adj.R2 = allreg.res1$adjr2,
                              CP = allreg.res1$cp)
allreg.compare1
##    model    Adj.R2        CP
## 1      1 0.2662577 419.01033
## 2      2 0.2809405 301.51148
## 3      3 0.2915207 217.13918
## 4      4 0.2984165 162.50349
## 5      5 0.3064961  98.34476
## 6      6 0.3114590  59.32933
## 7      7 0.3153303  29.12450
## 8      8 0.3160799  24.08108
## 9      9 0.3164721  21.91925
## 10    10 0.3169039  19.43997
## 11    11 0.3173029  17.22495
## 12    12 0.3175376  16.33495
## 13    13 0.3178574  14.76005
## 14    14 0.3179760  14.80510
## 15    15 0.3181035  14.77934
## 16    16 0.3181764  15.19312
## 17    17 0.3182498  15.60317
## 18    18 0.3182303  16.76046
## 19    19 0.3181744  18.21054
## 20    20 0.3180635  20.10251
## 21    21 0.3179518  22.00000

The model with the smallest CP value and greatest Adjusted R2 value is model number 15.

set.seed(123)
mlr.allreg1 <- lm(months ~ popularity +duration +acousticness +danceability +energy +instrumentalness+
  key1+key5 +key6 +key7 +key8+loudness + tempo +valence, data = key.dummy(train1))
summary(mlr.allreg1)
## 
## Call:
## lm(formula = months ~ popularity + duration + acousticness + 
##     danceability + energy + instrumentalness + key1 + key5 + 
##     key6 + key7 + key8 + loudness + tempo + valence, data = key.dummy(train1))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.48754 -0.40040  0.05556  0.45256  1.83006 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.0334496  0.1417347  21.402  < 2e-16 ***
## popularity       -0.0199882  0.0005048 -39.599  < 2e-16 ***
## duration          0.0021295  0.0002565   8.303  < 2e-16 ***
## acousticness     -0.1219629  0.0506774  -2.407 0.016132 *  
## danceability      0.6298047  0.0871300   7.228 5.56e-13 ***
## energy            0.3572279  0.0950323   3.759 0.000172 ***
## instrumentalness -0.7678735  0.1209540  -6.348 2.35e-10 ***
## key1              0.0567054  0.0264327   2.145 0.031974 *  
## key5              0.0478245  0.0342038   1.398 0.162103    
## key6              0.0682239  0.0333567   2.045 0.040874 *  
## key7              0.0738130  0.0261831   2.819 0.004833 ** 
## key8              0.0910011  0.0319686   2.847 0.004436 ** 
## loudness         -0.0740407  0.0060210 -12.297  < 2e-16 ***
## tempo             0.0006795  0.0003269   2.078 0.037717 *  
## valence           0.2887193  0.0530373   5.444 5.45e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6245 on 5489 degrees of freedom
## Multiple R-squared:  0.3197, Adjusted R-squared:  0.318 
## F-statistic: 184.3 on 14 and 5489 DF,  p-value: < 2.2e-16

The Multiple Linear Regression model resulting from all subsets regression for mode 1 songs is written below with coefficients rounded to the thousandths.

\[ log(360 - \hat{months}) = 3.033 - 0.020{popularity} + 0.002{duration} - 0.122{acousticness} + 0.630{danceability} + 0.357{energy} \\ - 0.768{instrumentalness} + 0.057{key1} + 0.048{key5} + 0.068{key6} + 0.074{key7} \\ + 0.091{key8} - 0.074{loudness} + 0.001{tempo} + 0.289{valence} \\ \]

Compared to the full multiple linear regression model, the all subsets multiple regression leaves out the dummy variables of Key2, key3, key4, Key10, key11 and Speechiness. All of the variables in the model besides Key5 have a significant linear relationship with month of song release. Furthermore, the model is very similar to that resulting from the stepwise method, however, this model has removed key 10 which was insignificant in the wtepwise model.

prediction on test data

# prediction on test data
yhat.allreg1 = predict(mlr.allreg1, newdata=key.dummy(test1))
# RMSE for test data
# error.allreg0 <- yhat.allreg0 - test0$months
# rmse.allreg0 <- sqrt(mean(error.allreg0^2))
data.frame(
  RMSE = RMSE(yhat.allreg1, test1$months),
  R2 = R2(yhat.allreg1, test1$months)
)
##        RMSE        R2
## 1 0.6071187 0.3328339

Regularized Regression: Ridge 10 Fold Cross Validation

set.seed(123)
lm.ridge1 <- train(months ~. , data = train1, method = "glmnet",
                      trControl = train_control, 
                      tuneGrid = expand.grid(alpha = 0,lambda = seq(0.001,0.1,by = 0.001)))
#Best Tuning Parameter
lm.ridge1$bestTune
##    alpha lambda
## 38     0  0.038
# best coefficient
lm.ridge1.model <- coef(lm.ridge1$finalModel, lm.ridge1$bestTune$lambda)
lm.ridge1.model
## 22 x 1 sparse Matrix of class "dgCMatrix"
##                              1
## (Intercept)       3.1431295345
## popularity       -0.0190324693
## duration          0.0020362786
## acousticness     -0.1418149731
## danceability      0.5970842302
## energy            0.2791295082
## instrumentalness -0.6904944050
## key1              0.0657555894
## key2              0.0117781483
## key3              0.0010085253
## key4              0.0500939827
## key5              0.0563973003
## key6              0.0773915151
## key7              0.0810143099
## key8              0.0962618966
## key9             -0.0050702623
## key10             0.0647627234
## key11             0.0284046850
## loudness         -0.0675138250
## speechiness      -0.1577203949
## tempo             0.0006447657
## valence           0.2966971890

The Ridge Multiple Linear Regression model for mode 1 songs is written below with coefficients rounded to the thousandths. \[ log(360 - \hat{months}) = 3.143 - 0.019{popularity} + 0.002{duration} - 0.142{acousticness} + 0.597{danceability} + 0.279{energy} \\ - 0.690{instrumentalness} + 0.066{key1} + 0.012{key2} + 0.001{key3} + 0.050{key4} + 0.056{key5} \\ + 0.077{key6} + 0.081{key7} + 0.096{key8} - 0.005{key9} + 0.065{key10} + 0.028{key11} \\ - 0.068{loudness} - 0.158{speechiness} + 0.001{tempo} + 0.297{valence} \]

The ridge model keeps all variables, therefore, there have been none removed for the model. However, the ridge model does aim to reduce all coefficients closer to zero.

# prediction on test data
yhat.ridge1 = predict(lm.ridge1, s=lm.ridge1$bestTune,newdata=test1)
# RMSE for test data
data.frame(
  RMSE = RMSE(yhat.ridge1, test1$months),
  R2 = R2(yhat.ridge1, test1$months)
)
##       RMSE       R2
## 1 0.607864 0.332307

Regularized Regression: Lasso 10 Fold Cross Validation

set.seed(123)
lm.lasso1 <- train(months ~. , data = train1, method = "glmnet",
                      trControl = train_control, 
                      tuneGrid = expand.grid(alpha = 1,lambda = seq(0.001,0.1,by = 0.001)))
#Best Tuning Parameter
lm.lasso1$bestTune
##   alpha lambda
## 2     1  0.002
# best coefficient
lm.lasso1.model <- coef(lm.lasso1$finalModel, lm.lasso1$bestTune$lambda)
lm.lasso1.model
## 22 x 1 sparse Matrix of class "dgCMatrix"
##                              1
## (Intercept)       3.1112735793
## popularity       -0.0198676045
## duration          0.0020253478
## acousticness     -0.1273907684
## danceability      0.6136648633
## energy            0.3289972316
## instrumentalness -0.7308755524
## key1              0.0546850444
## key2              .           
## key3              .           
## key4              0.0330294002
## key5              0.0423364500
## key6              0.0643247639
## key7              0.0712474224
## key8              0.0865202360
## key9             -0.0057034369
## key10             0.0473962432
## key11             0.0135360399
## loudness         -0.0712204240
## speechiness      -0.1174132695
## tempo             0.0006192491
## valence           0.2906117381

The Lasso Multiple Linear Regression model for mode 1 songs is written below with coefficients rounded to the thousandths. \[ log(360 - \hat{months}) = 3.055 - 0.020{popularity} + 0.002{duration} - 0.127{acousticness} + 0.622{danceability} + 0.354{energy} \\ - 0.750{instrumentalness} + 0.066{key1} + 0.010{key2} + 0.047{key4} + 0.055{key5} \\ + 0.077{key6} + 0.084{key7} + 0.0997{key8} + 0.061{key10} + 0.027{key11} \\ - 0.073{loudness} - 0.148{speechiness} + 0.001{tempo} + 0.291{valence} \]

Compared to the full model, the Lasso model omits dummy variables Key3 and Key9

# prediction on test data
yhat.lasso1 = predict(lm.lasso1, s=lm.lasso1$bestTune,newdata=test1)
# RMSE for test data
error.lasso1 <- yhat.lasso1 - test1$months
rmse.lasso1 <- sqrt(mean(error.lasso1^2))
data.frame(
  RMSE = RMSE(yhat.lasso1, test1$months),
  R2 = R2(yhat.lasso1, test1$months)
)
##        RMSE        R2
## 1 0.6071194 0.3329693

Regularized Regression : Elastic Net 10 Fold Cross Validation

lm.enet1 <- train(months ~. , data = train1, method = "glmnet",
                      trControl = train_control, 
                      tuneGrid = expand.grid(alpha = seq(0,1,by = 0.05),lambda = seq(0.001,0.1,by = 0.001)))
# best parameter
lm.enet1$bestTune
##     alpha lambda
## 703  0.35  0.003
# best coefficient
lm.enet1.model <- coef(lm.enet1$finalModel, lm.enet1$bestTune$lambda)
lm.enet1.model
## 22 x 1 sparse Matrix of class "dgCMatrix"
##                              1
## (Intercept)       3.0654978896
## popularity       -0.0198288623
## duration          0.0020531381
## acousticness     -0.1280776202
## danceability      0.6201027877
## energy            0.3470529576
## instrumentalness -0.7451241876
## key1              0.0653822814
## key2              0.0090165251
## key3              .           
## key4              0.0462348608
## key5              0.0544359122
## key6              0.0758911127
## key7              0.0824078078
## key8              0.0983727026
## key9             -0.0004028006
## key10             0.0601813747
## key11             0.0258287630
## loudness         -0.0728473662
## speechiness      -0.1456355940
## tempo             0.0006702156
## valence           0.2920787913

\[ log(360 - \hat{months}) = 3.101 - 0.020{popularity} + 0.002{duration} - 0.127s{acousticness} + 0.615{danceability} + 0.334{energy} \\ - 0.734{instrumentalness} + 0.056{key1} + 0.035{key4} + 0.044{key5} \\ + 0.066{key6} + 0.073{key7} + 0.088{key8} - 0.005{key9} + 0.049{key10} + 0.016{key11} \\ - 0.072{loudness} - 0.123{speechiness} + 0.001{tempo} + 0.291{valence} \]

# prediction on test data
yhat.enet1 = predict(lm.enet1, s=lm.enet1$bestTune, newdata=test1)
# RMSE for test data
error.enet1 <- yhat.enet1 - test1$months
rmse.enet1 <- sqrt(mean(error.enet1^2))
data.frame(
  RMSE = RMSE(yhat.enet1, test1$months),
  R2 = R2(yhat.enet1, test1$months)
)
##        RMSE        R2
## 1 0.6073717 0.3323557

Determining Best Model and Interpreting.

We will be using the measure of Root Mean Square Error (RMSE) and the R-Squared value to assess the performance of these models. RMSE is calculated: \[ \sqrt {\frac{1}{n} \Sigma^{n}_{i=1}(y_i - \hat {y_i})^2} \] where \(y_i\) are the observed values for month of release since 1992, \(\hat {y_i}\) are the predicted month values, and \(i = 1, ... , n\) where \(n = 1835\) for the test data of songs that are mode 1.

results1 <- data.frame(Model = c("FullMLR", "Stepwise 10CV", "AllSubsets","Ridge 10CV", "Lasso 10CV", "Elastic Net 10CV"),
                       RMSE = round(c(0.6078257, 0.6070693,0.6071187,0.607864,0.6073883, 0.6073717),5),
                       R2 = round(c(0.3312429, 0.3329448, 0.3328339,0.332307,0.3322821, 0.3323557), 5))
results1
##              Model    RMSE      R2
## 1          FullMLR 0.60783 0.33124
## 2    Stepwise 10CV 0.60707 0.33294
## 3       AllSubsets 0.60712 0.33283
## 4       Ridge 10CV 0.60786 0.33231
## 5       Lasso 10CV 0.60739 0.33228
## 6 Elastic Net 10CV 0.60737 0.33236

The model with the smallest RMSE (on the log(360 - y) scale) and largest R-Squared values is the Multiple Linear Regression from the Stepwise method using 10 fold cross validation.

Stepwise regression is the best model.

summary(mlr.step_kcv1$finalModel)
## 
## Call:
## lm(formula = .outcome ~ popularity + duration + acousticness + 
##     danceability + energy + instrumentalness + key1 + key5 + 
##     key6 + key7 + key8 + key10 + loudness + tempo + valence, 
##     data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4842 -0.4012  0.0563  0.4534  1.7748 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.0299902  0.1417423  21.377  < 2e-16 ***
## popularity       -0.0199787  0.0005048 -39.580  < 2e-16 ***
## duration          0.0021218  0.0002565   8.272  < 2e-16 ***
## acousticness     -0.1239002  0.0506910  -2.444 0.014548 *  
## danceability      0.6302868  0.0871225   7.234 5.31e-13 ***
## energy            0.3566622  0.0950243   3.753 0.000176 ***
## instrumentalness -0.7653059  0.1209561  -6.327 2.70e-10 ***
## key1              0.0616367  0.0266563   2.312 0.020799 *  
## key5              0.0529274  0.0343880   1.539 0.123832    
## key6              0.0732353  0.0335388   2.184 0.029034 *  
## key7              0.0786565  0.0264008   2.979 0.002902 ** 
## key8              0.0959522  0.0321543   2.984 0.002857 ** 
## key10             0.0597350  0.0419649   1.423 0.154663    
## loudness         -0.0739348  0.0060209 -12.280  < 2e-16 ***
## tempo             0.0006867  0.0003269   2.101 0.035717 *  
## valence           0.2887687  0.0530324   5.445 5.40e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6244 on 5488 degrees of freedom
## Multiple R-squared:   0.32,  Adjusted R-squared:  0.3181 
## F-statistic: 172.1 on 15 and 5488 DF,  p-value: < 2.2e-16

\[ log(360 - \hat{months}) = 3.020 - 0.020{popularity} + 0.002{duration} - 0.124{acousticness} + 0.630{danceability} + 0.357{energy} \\ - 0.765{instrumentalness} + 0.062{key1} + 0.053{key5} + 0.073{key6} + 0.079{key7} \\ + 0.096{key8} + 0.060{key10} - 0.074{loudness} + 0.001{tempo} + 0.289{valence} \]

Converting it back to normal scale would result in the following \[ 360 - \hat{months} = \frac{exp(3 + 0.002{duration} + 0.630{danceability} + 0.357{energy} + 0.062{key1} + 0.053{key5} + 0.073{key6} + 0.079{key7} + 0.096{key8} + 0.060{key10} + 0.001{tempo} + 0.289{valence})}{exp(0.020{popularity} + 0.124{acousticness} + 0.765{instrumentalness} + 0.074{loudness} )} \]